Project 3 is due by 7:00 p.m. EST on Tuesday, December 16, and just a reminder that no late pass is allowed for Project 3, as the deadline is the final assessment deadline set by the university. You must submit the project by the deadline. Please submit both a .pdf and a .rmd files on Gradescope by the deadline.

Please do not email your work to any of the instructors in any case (unless you have no access to Gradescope at the time when the assignment is due–in this case please email the Head TA (see syllabus for contact info) both copies of your .pdf and .rmd files before the deadline and upload the unaltered copies of the files to Gradescope as soon as you regain access to Gradescope). We need your files to be on Gradescope in order to grade them. Thus, even if you are late, please still submit the files on Gradescope.

This assignment can be completed in groups of up to 3 students. Feel free to use Ed Discussion to recruit team members. It is okay to work by yourself, if this is preferred. You are welcome to get help (you can either ask questions on Ed Discussion or talk to the instructors during office hours) from the instructors; however, please do not post code/solutions on Ed Discussion on a public post. Please see the Academic Integrity Policy tab on Canvas about the course policy.

For problem sets and projects you may not work with a given student on more than two assignments. For example, if you work with student A on Problem set 1 and Project 1, then you cannot work with student A on any other problem sets and projects.

When working in a group it is your responsibility to make sure that you are satisfied with all parts of the report and the submission is on time (e.g., we will not entertain arguments that deficiencies are the responsibility of other group members). We expect that you each work independently first and then compare your answers with each other once you all finish, or help each other if someone in your group gets stuck. Failing to make contributions and then putting your name on a report will be considered a violation of the honor code. Please do not divide work among group mates. Everyone in your group is responsible for being able to explain the solutions in the report.

For all parts of this project, you MUST use R commands to print the output as part of your R Markdown file. You are not permitted to find the answer in the R console and then copy and paste the answer into your report. All code and outputs should be visible on your pdf file.

If you are completing this project in a group, please have only one person in your group turn in the .Rmd and .pdf files. Make sure that this person selects the names of all the group mates on Gradescope–if you are not sure about how to make a group submission please see the video posted under Assignments > Precept submission instructions on Canvas. To receive full credit for this assignment, you must select the pages properly for each problem on Gradescope; again, if you are not sure about how to do this, please watch the video mentioned.

Late problem sets and projects (i.e., being late even after you use a late pass) will be penalized at intervals rounded up to multiples of 24 hours. You will receive a 10% deduction on the score for each 24-hour-late-interval. For example, if you are 3 hours late, 10% off or if you are 30 hours late, 20% off. A late pass is not allowed for assignments that are due on Dean’s date.


Please type your name(s) after “Digitally signed:” below the honor pledge to serve as digital signature(s).

I pledge my honor that I have not violated the honor code when completing this assignment.

Digitally signed: Cindy Ruoheng Li


When reporting numerical findings in the narrative part of your report, please round answers to an appropriate number of digits. This means that usually having 2 decimal places (e.g., 45.71) is good; however, if the answer has a lot of leading zeros (e.g., 0.000034), you might want to include at 2 or more non-zero digits. Just make sure that you do not report answers in long strings of numbers (e.g., 5.043289); this will make your report look unprofessional. When printing out a data frame as part of an answer, you can use print.data.frame(your_df, digits = 2); this will allow you to print out the entire data frame your_df and suggest to R that you want to have at least 2 significant digits–you might want to adjust the value for digits based on individual cases. Remember not to round intermediate calculations and please avoid hard-coding. For example, if we updated the dataset loan_data.csv with additional rows, your code should still work without any modifications.


Report quality

Unless otherwise specified, please provide code and the output of the code to support all your answers.

In general if your report contains plots, in order to receive full credits, please have sensible titles and axis labels for all your graphs and adjust values for all the relevant graphical parameters so that your plots are informative. Also, all answers must be written in complete sentences.

(-3 pts each: code does not run; code does not have proper annotations; student did not select pages correctly for each question on Gradescope)

(-2 pts each: answers are not in complete sentences; report includes an unnecessarily large chunk of a dataset.)

Before you start: for-loops are not allowed for this project.


Objective of this project

(Hypothetical) Congrats! You have been hired as an intern at Redfin (Redfin.com) in Seattle. For your first project, your manager would like you to build a new model to predict house prices in the King County area of Washington State.

To view the predicted sold prices by Redfin’s current model, refer to the number displayed in the Redfin Estimate for … section. For instance, for this house (https://www.redfin.com/NJ/Princeton/17-Aiken-Ave-08540/home/36691957), as of Nov. 29, 2025, the Redfin estimate is $1,129,552 compared to the asking price of $1.145 million.

Background info and the dataset

For this project we will use a dataset1 that contains transactional information for houses sold between May 2014 and May 2015 in King County of Washington, including Seattle.

The data for this project is stored in the file house_data_cleaned.csv, and the definitions for the variables in the dataset are listed below.

Question 1 (4 pts: 2 for renov_Y.N, 2 for the rest)

Read the dataset house_data_cleaned.csv into R and name it house. house should have 21605 rows and 15 columns.

Convert the following variables to categorical variables: waterfront, floors, condition, and zipcode. (I have checked and there is not much linear relationship between the numeric version of floors and price, and between the numeric version of condition and price. Therefore, it will be better to treat these variables as categorical variables.)

Also, add an additional column renov_Y.N to house. renov_Y.N should be a categorical variable. An element in renov_Y.N is Y if the corresponding element in yr_renovated is greater than zero, and N otherwise. The function dplyr::if_else() might be helpful here–you can check out its help manual–but you are not required to use it.

Print the summary statistics for each column in the data frame house (i.e., the five-number summary and mean for numeric columns, and the frequency count for each level of categorical columns) after completing all the modifications above. (Hint: This can be done in a single line of code.)

Answer

house <- read.csv("/Users/cindyli/Desktop/house_data_cleaned.csv")
dim(house)
[1] 21605    15

#convert variables to categorical variables
house$waterfront <- as.factor(house$waterfront)
house$floors <- as.factor(house$floors)
house$condition <- as.factor(house$condition)
house$zipcode <- as.factor(house$zipcode)

# Add renov_Y.N column 
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
house <- house |>
  mutate(renov_Y.N = factor(yr_renovated > 0, labels = c("N", "Y")))

#summary of each column 
house_summary <- summary(house)
print(house_summary)
     date               price            bedrooms     
 Length:21605       Min.   :  75000   Min.   : 0.000  
 Class :character   1st Qu.: 322000   1st Qu.: 3.000  
 Mode  :character   Median : 450000   Median : 3.000  
                    Mean   : 540090   Mean   : 3.371  
                    3rd Qu.: 645000   3rd Qu.: 4.000  
                    Max.   :7700000   Max.   :11.000  
                                                      
   bathrooms      sqft_living       sqft_lot       floors     
 Min.   :0.000   Min.   :  370   Min.   :    520   1  :10677  
 1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040   1.5: 1910  
 Median :2.250   Median : 1910   Median :   7620   2  : 8238  
 Mean   :2.115   Mean   : 2080   Mean   :  15109   2.5:  161  
 3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688   3  :  612  
 Max.   :8.000   Max.   :13540   Max.   :1651359   3.5:    7  
                                                              
 waterfront condition     grade          sqft_above  
 0:21442    1:   29   Min.   : 3.000   Min.   : 370  
 1:  163    2:  172   1st Qu.: 7.000   1st Qu.:1190  
            3:14026   Median : 7.000   Median :1560  
            4: 5678   Mean   : 7.657   Mean   :1788  
            5: 1700   3rd Qu.: 8.000   3rd Qu.:2210  
                      Max.   :13.000   Max.   :9410  
                                                     
 sqft_basement       yr_built     yr_renovated    
 Min.   :   0.0   Min.   :1900   Min.   :   0.00  
 1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.00  
 Median :   0.0   Median :1975   Median :   0.00  
 Mean   : 291.6   Mean   :1971   Mean   :  84.43  
 3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.00  
 Max.   :4820.0   Max.   :2015   Max.   :2015.00  
                                                  
    zipcode      renov_Y.N
 98103  :  601   N:20691  
 98038  :  590   Y:  914  
 98115  :  583            
 98052  :  574            
 98117  :  553            
 98042  :  548            
 (Other):18156            

Exploring the relationship between price and other variables.

Question 2 (30 pts)

Part a (7 pts: 3 for the answer; 4 for the graph)

Is the distribution of the variable price symmetric or skewed? If it is skewed, does it have a long left or right tail? Please provide 1 graph to support your statement.

Answer

library(ggplot2)

#calculate numerical values to measure skewness 
price_summary <- summary(house$price)
print(price_summary)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  75000  322000  450000  540090  645000 7700000 

#I suggest that the price is right-skewed with a long right tail because 
#the mean is higher than the median, the bigger data drew the mean to be higher
#than the median 

#Graph Justification 
Graph_justification <-  ggplot(house, aes(x = price)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 50, 
                 fill = "steelblue", 
                 alpha = 0.7, 
                 color = "white") +
  geom_density(color = "darkred", linewidth = 1) +
  labs(title = "Distribution of House Prices",
       x = "Price ($)",
       y = "Density") 
print(Graph_justification)


#The graph justified my speculation that the price dataset does have a long 
#right tail 

Part b (4 pts: 2 for the number; 1 for the dimensions; 1 for creating majority)

To ensure the quality of our analysis, we will exclude houses that have extreme sold prices, i.e., houses with prices greater than or equal to $4,000,000 in our analysis. How many houses will we exclude, out of 21605 houses?

Name the dataset with house prices less than $4,000,000 majority. What is the dimensions of majority?

Also, replace the values in the price column in majority with the natural logarithm of the prices. The goal of doing the logarithmic transformation is to make the data fit the assumption of the model better when we build linear models later.

Answer

#count the number of houses to exclude
houses_to_exclude <- sum(house$price >= 4000000)
#out of 21605 houses, 12 houses will be excluded 

#create the majority dataset 
library(dplyr)
majority <- house |>
  filter(price < 4000000)
dim(majority)
[1] 21593    16
#the dimension of the majority dataset is 21593, 16

#Replace price column with the natural logarithm of the prices 
majority$price <- log(majority$price)

#Graphic result of the log conversion of the majority data set 
ggplot(majority, aes(x = price)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 50, 
                 fill = "steelblue", 
                 alpha = 0.7, 
                 color = "white") +
  geom_density(color = "darkred", linewidth = 1) +
  labs(title = "Distribution of House Prices (Natural Log)",
       x = "Price ($)",
       y = "Density") 

Part c (19 pts: 12 for the plots; 3 for identifying the pair; 4 for explanation)

Use ggpairs() in the GGally package to make matrix plots. Investigate the pairwise relationships between the variables; give different colors to the data points depending on whether the house has a view to a waterfront. Note that due to the large number of variables, you will need to use several matrix plots since one matrix plot will not fit all the variables. Make sure all the relevant information that you want to display on the matrix plots is legible.

Include sqft_living, sqft_lot, sqft_above and sqft_basement in the same plot to see if any of these variables are correlated. Among these four variables, which pair has the largest correlation coefficient? In linear modeling, is it good to include predictors that are highly correlated with one another (i.e., with correlations close to -1 or 1) in a model? Explain why or why no.

Since we have limited time for this project I have investigated the relationship between the month of sale and price and did not see any patterns; therefore, we will not include the variable date in our model or the matrix plots.

Also, note that including zipcode in a scatterplot matrix will not be helpful since the graph will be too small to show the relationship between price and zipcode; thus, we will investigate the relationship between price and zipcode separately later in the report.

Hints: If you are confused about this part, consider the following:

  • The matrix plots are intended to help prepare for the model building in the later part of the project. Think about which pairs of variables you would like to examine in order to explore their pairwise relationships for the potential linear models.

  • Based on the variable definitions, the variables that could potentially be strongly correlated should be placed in the same matrix plot. Why? There is more than one valid way to group the variables for the graphs, so be sure to explain the rationale for your choices in your report. Also note that some variables may appear in more than one matrix plot.

Answer

#Create the first GGpairs Graph: with `sqft_living`, `sqft_lot`, `sqft_above` 
#and `sqft_basement`
library(GGally)
library(ggplot2)

# Select the variables of interest
selected_vars <- majority[, c("sqft_living", "sqft_lot", "sqft_above",
                              "sqft_basement", "waterfront")]

# Create the matrix plot with colors based on waterfront
pairs_plot <- ggpairs(
  selected_vars,
  columns = 1:4, 
  mapping = aes(color = waterfront, alpha = 0.7),
  title = "Pairwise Relationships of Square Footage Variables",
  upper = list(continuous = wrap("cor", size = 4)),
  lower = list(continuous = wrap("points", size = 0.5)),
  diag = list(continuous = wrap("densityDiag", alpha = 0.5))
) 

print(pairs_plot)

#Among these four variables, sqft_living and sqft_above has the highest 
#correlation, with a numerical value close to 0.9
#No, it is generally not good to include highly correlated predictors 
# (multicollinearity) in a linear model because it will lead to an increase
#in standard error. So sqft_living and sqft_above should not be both included 
#in the training model 

For my final project, I want to group these variables into two group of matrix plot, the first is based on structure&size, second is construction details&age


library(dplyr)
library(GGally)
library(ggplot2)

# Matrix 1: Structural and Size Variables
structural_vars <- c("price", "sqft_living", "sqft_lot", "bedrooms", "bathrooms",
                     "floors", "grade", "waterfront")

# Plot 1: Correlation matrix for continuous variables
matrix_structural <- ggpairs(
  majority[, structural_vars],
  mapping = aes(color = waterfront, alpha = 0.6),
  title = "Pairwise Relationships: Structural and Size Variables",
  upper = list(
    continuous = wrap("cor", size = 2),
    discrete   = wrap("count", size = 1.5)         
  ),
  lower = list(
    continuous = wrap("points", size = 0.05),       
    combo      = wrap("box_no_facet",
                      linewidth = 0.2,
                      outlier.shape = NA),
    discrete   = wrap("count", size = 1.5)          
  ),
  diag = list(
    continuous = wrap("densityDiag", alpha = 0.25),
    discrete   = wrap("barDiag")
  )
) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.text.y = element_text(size = 8)
  )



# Display the plot
print(matrix_structural)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2
3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the GGally
  package.
  Please report the issue at
  <https://github.com/ggobi/ggally/issues>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this
warning was generated.


#There are high correlation between price and sqft_living, price and bathrooms
#followed by price and bedrooms and price & sqft_lot has low correlations 
#There is some correlation between price and floors, not very strong 
#There is a strong correlation between price and grade of the house 
#There is a strong correlation bewteen price and the waterfront 

#Among the variables, sqft_living and has strong correlation with bedrooms 
#bathrooms and grades
#bedrooms have strong correlation with bathrooms

library(dplyr)
library(GGally)
library(ggplot2)

# Matrix 2: Construction Details & Age
construction_vars <- c("price", "sqft_above", "sqft_basement" ,
                         "yr_built", "yr_renovated","condition","renov_Y.N" )


# Plot 1: Correlation matrix for continuous variables
matrix_construction <- ggpairs(
  majority[, c(construction_vars, "waterfront")],
  mapping = aes(color = waterfront, alpha = 0.6),
  title = "Pairwise Relationships: Construction Details & Age",
  upper = list(
    continuous = wrap("cor", size = 2),
    discrete   = wrap("count", size = 1.5)
  ),
  lower = list(
    continuous = wrap("points", size = 0.05),
    combo      = wrap("box_no_facet", linewidth = 0.2, outlier.shape = NA),
    discrete   = wrap("count", size = 1.5)
  ),
  diag = list(
    continuous = wrap("densityDiag", alpha = 0.25),
    discrete   = wrap("barDiag")
  )
) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.text.y = element_text(size = 8)
  )

# Display the plot
print(matrix_construction)



#There are strong correlation between price and sqft_above, some correlation 
#between price and sqft_basement
#There are NO strong correlation between price and yr_built, price and yr_renovated 
#There are no strong correlation between price and conditions 
#There are also no strong correlation between price and if it has been renovated or not 

Question 3 (15 pts) Zipcode variable

Part a (7 pts; 1 for the answer, 3 for the plot, 3 for the explanation)

Make side by side boxplots to compare the values in price by zip code. Based on your boxplots, is it good to include the dummy variables for some of the zip codes in your model? Explain.

Answer

library(ggplot2)

ggplot(majority, aes(x = zipcode, y = price)) +
  geom_boxplot(fill = "lightblue", alpha = 0.7, outlier.size = 0.5) +
  labs(
    title = "House Price Distribution by Zip Code",
    subtitle = "Log-transformed prices",
    x = "Zip Code",
    y = "log(Price)"
  )+
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 8)
  )


#It is good to include some zip code in my boxplots 
#Because from the boxplots we can see that there are clear differences between
#the median prices of zip codes. And this make sense because location has a huge
#impact on house prices 
#But including too many dummy variables is also not necessary, because many zip
#code may only have a few observation and their are also multicollinearity risk
#for zip code that has price close to each other, this could lead to an 
#overfitting dataset. 

Part b (8 pts; 1 for each answer, 3 for each interpretation; no credit for correct answer with a wrong explanation)

Consider the following model (you will need to remove eval=F to run the code below):

summary(lm(price~factor(zipcode), data = majority))

Call:
lm(formula = price ~ factor(zipcode), data = majority)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.73975 -0.22711 -0.02209  0.19621  2.05618 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          12.49312    0.01893 660.132  < 2e-16 ***
factor(zipcode)98002 -0.15542    0.03175  -4.896 9.87e-07 ***
factor(zipcode)98003  0.04673    0.02863   1.632 0.102689    
factor(zipcode)98004  1.50075    0.02775  54.085  < 2e-16 ***
factor(zipcode)98005  1.06698    0.03358  31.772  < 2e-16 ***
factor(zipcode)98006  1.06588    0.02487  42.865  < 2e-16 ***
factor(zipcode)98007  0.79309    0.03571  22.210  < 2e-16 ***
factor(zipcode)98008  0.78519    0.02855  27.503  < 2e-16 ***
factor(zipcode)98010  0.35845    0.04063   8.822  < 2e-16 ***
factor(zipcode)98011  0.58114    0.03196  18.185  < 2e-16 ***
factor(zipcode)98014  0.40488    0.03743  10.818  < 2e-16 ***
factor(zipcode)98019  0.42157    0.03223  13.081  < 2e-16 ***
factor(zipcode)98022  0.10559    0.03018   3.499 0.000468 ***
factor(zipcode)98023  0.01930    0.02485   0.777 0.437281    
factor(zipcode)98024  0.63356    0.04443  14.258  < 2e-16 ***
factor(zipcode)98027  0.75938    0.02592  29.294  < 2e-16 ***
factor(zipcode)98028  0.50656    0.02855  17.744  < 2e-16 ***
factor(zipcode)98029  0.78602    0.02759  28.494  < 2e-16 ***
factor(zipcode)98030  0.07563    0.02938   2.574 0.010053 *  
factor(zipcode)98031  0.09133    0.02884   3.167 0.001545 ** 
factor(zipcode)98032 -0.09494    0.03732  -2.544 0.010964 *  
factor(zipcode)98033  0.98826    0.02565  38.522  < 2e-16 ***
factor(zipcode)98034  0.58310    0.02440  23.897  < 2e-16 ***
factor(zipcode)98038  0.26728    0.02403  11.124  < 2e-16 ***
factor(zipcode)98039  1.91159    0.05576  34.283  < 2e-16 ***
factor(zipcode)98040  1.39860    0.02863  48.843  < 2e-16 ***
factor(zipcode)98042  0.10530    0.02437   4.320 1.57e-05 ***
factor(zipcode)98045  0.42067    0.03071  13.697  < 2e-16 ***
factor(zipcode)98052  0.84223    0.02415  34.869  < 2e-16 ***
factor(zipcode)98053  0.86400    0.02604  33.177  < 2e-16 ***
factor(zipcode)98055  0.08601    0.02899   2.967 0.003013 ** 
factor(zipcode)98056  0.36482    0.02601  14.025  < 2e-16 ***
factor(zipcode)98058  0.22973    0.02534   9.064  < 2e-16 ***
factor(zipcode)98059  0.53138    0.02519  21.097  < 2e-16 ***
factor(zipcode)98065  0.63305    0.02787  22.716  < 2e-16 ***
factor(zipcode)98070  0.52954    0.03813  13.888  < 2e-16 ***
factor(zipcode)98072  0.69931    0.02884  24.248  < 2e-16 ***
factor(zipcode)98074  0.88870    0.02552  34.821  < 2e-16 ***
factor(zipcode)98075  1.04285    0.02680  38.910  < 2e-16 ***
factor(zipcode)98077  0.87024    0.03180  27.367  < 2e-16 ***
factor(zipcode)98092  0.17242    0.02695   6.397 1.62e-10 ***
factor(zipcode)98102  1.04484    0.04017  26.012  < 2e-16 ***
factor(zipcode)98103  0.72852    0.02394  30.426  < 2e-16 ***
factor(zipcode)98105  1.06560    0.03038  35.079  < 2e-16 ***
factor(zipcode)98106  0.13466    0.02728   4.937 8.01e-07 ***
factor(zipcode)98107  0.73387    0.02906  25.257  < 2e-16 ***
factor(zipcode)98108  0.24212    0.03245   7.460 8.96e-14 ***
factor(zipcode)98109  1.09223    0.03930  27.793  < 2e-16 ***
factor(zipcode)98112  1.28615    0.02896  44.408  < 2e-16 ***
factor(zipcode)98115  0.78480    0.02408  32.589  < 2e-16 ***
factor(zipcode)98116  0.77478    0.02739  28.292  < 2e-16 ***
factor(zipcode)98117  0.72404    0.02433  29.759  < 2e-16 ***
factor(zipcode)98118  0.34882    0.02475  14.092  < 2e-16 ***
factor(zipcode)98119  1.06679    0.03257  32.753  < 2e-16 ***
factor(zipcode)98122  0.79002    0.02836  27.862  < 2e-16 ***
factor(zipcode)98125  0.50287    0.02595  19.377  < 2e-16 ***
factor(zipcode)98126  0.40257    0.02690  14.968  < 2e-16 ***
factor(zipcode)98133  0.34112    0.02491  13.695  < 2e-16 ***
factor(zipcode)98136  0.65367    0.02915  22.424  < 2e-16 ***
factor(zipcode)98144  0.66692    0.02711  24.598  < 2e-16 ***
factor(zipcode)98146  0.16194    0.02841   5.700 1.21e-08 ***
factor(zipcode)98148  0.02086    0.05125   0.407 0.683969    
factor(zipcode)98155  0.37953    0.02547  14.901  < 2e-16 ***
factor(zipcode)98166  0.41984    0.02945  14.257  < 2e-16 ***
factor(zipcode)98168 -0.14866    0.02896  -5.133 2.88e-07 ***
factor(zipcode)98177  0.80536    0.02941  27.380  < 2e-16 ***
factor(zipcode)98178  0.06626    0.02918   2.270 0.023195 *  
factor(zipcode)98188  0.02549    0.03618   0.705 0.481122    
factor(zipcode)98198  0.03893    0.02863   1.359 0.174019    
factor(zipcode)98199  0.99399    0.02768  35.913  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3596 on 21523 degrees of freedom
Multiple R-squared:  0.5298,    Adjusted R-squared:  0.5283 
F-statistic: 351.4 on 69 and 21523 DF,  p-value: < 2.2e-16

What is the estimate for the y-intercept of the model? Interpret the meaning of the y-intercept. What is the coefficient estimate for the dummy variable for zip code 98004? interpret this number too. It is okay to give your explanation in terms of the log(price) instead of the original price variable.

Answer The y intercept is 12.49312 The y intercept represents the reference zip code which is 98001 For zipcode 98004, the coefficient estimate is 12.49312+1.50075= 13.99387, the coefficient value 1.50075 represents the different in log price between zipcode 98004 and 98001, so adding it to the intercept would give the coefficient estimate for 98004

Divide data into non-test set and two test sets.

For this part you just need to remove the eval=F argument for each of the code chunks below and run the code; no action is required from you other than that. Please make sure that you understand how each part is created.

majority covers the period from May 2014 to May 2015. We will prepare a vector date.format that we can use to extract out the observations that correspond to transactions closed in May 2015.

date.format = format(as.Date(majority$date, '%Y%m%dT'), "%Y%m")
length(date.format)
[1] 21593
head(date.format)
[1] "201410" "201412" "201502" "201412" "201502" "201405"
dim(majority)
[1] 21593    16

Then, the following lines will extract out all the observations with transactions closed in May 2015. We will use these observations for our out-of-time test set test2.


test2 = majority |> 
          filter(date.format %in% '201505') |> 
          select(-date)

dim(test2)
[1] 645  15

For the rest of the observations run the following code chunk; this will split the remaining observations into 2 sets: 85% of the data for the non-test set, and 15% of the data for the in-time test set (i.e., test1).

# remove the `date` column from the data frame
# keep the rows that are not correspond to the period May 2015
library(tidymodels) 
── Attaching packages ───────────────────── tidymodels 1.4.1 ──
✔ broom        1.0.9     ✔ rsample      1.3.1
✔ dials        1.4.2     ✔ tailor       0.1.0
✔ infer        1.0.9     ✔ tidyr        1.3.1
✔ modeldata    1.5.1     ✔ tune         2.0.1
✔ parsnip      1.3.3     ✔ workflows    1.3.0
✔ purrr        1.1.0     ✔ workflowsets 1.1.1
✔ recipes      1.3.1     ✔ yardstick    1.3.2
── Conflicts ──────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
tmp = majority |> 
          filter(!date.format %in% '201505') |> 
          select(-date)

dim(tmp)
[1] 20948    15

set.seed(1746)
data.split = initial_split(tmp, 
                           prop=.85, strata='waterfront')

non_test = training(data.split)
test1 = testing(data.split)

dim(non_test); dim(test1)
[1] 17805    15
[1] 3143   15

# names(non_test); names(test1)

Model building

We are now ready to build our model! majority has been divided into 3 sets:

Question 4 (13 pts)

Here we will build an Ordinary Least Squares (OLS) regression model with all the variables available to us. (Note that “an OLS regression model” is simply another name for the regular linear regression model.)

Part a (4 pts)

Build a OLS model with all the variables available to us. Note that you might see that R returns NA for the coefficient estimate for sqft_basement. This means that sqft_basement is highly correlated with one or more variables (other than price) in the dataset. Therefore, you should explicitly exclude sqft_basement in your model. Name the output of your fitted OLS model ols. Please print all the coefficient estimates in your report.

Answer

#construct the ols model
ols <- lm(price ~ . - sqft_basement, data = non_test)
summary(ols)

Call:
lm(formula = price ~ . - sqft_basement, data = non_test)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29230 -0.09949  0.00548  0.10552  1.18541 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.209e+01  1.771e-01  68.258  < 2e-16 ***
bedrooms     -4.086e-03  2.106e-03  -1.941 0.052333 .  
bathrooms     4.042e-02  3.462e-03  11.677  < 2e-16 ***
sqft_living   1.775e-04  4.545e-06  39.064  < 2e-16 ***
sqft_lot      7.070e-07  3.837e-08  18.423  < 2e-16 ***
floors1.5     7.832e-03  5.758e-03   1.360 0.173753    
floors2      -2.274e-02  4.846e-03  -4.692 2.73e-06 ***
floors2.5    -4.290e-02  1.762e-02  -2.434 0.014928 *  
floors3      -1.301e-01  1.081e-02 -12.038  < 2e-16 ***
floors3.5    -2.555e-01  8.544e-02  -2.990 0.002792 ** 
waterfront1   6.872e-01  1.698e-02  40.469  < 2e-16 ***
condition2    8.702e-02  4.087e-02   2.129 0.033260 *  
condition3    2.288e-01  3.761e-02   6.084 1.20e-09 ***
condition4    2.779e-01  3.762e-02   7.387 1.57e-13 ***
condition5    3.368e-01  3.785e-02   8.897  < 2e-16 ***
grade         1.132e-01  2.299e-03  49.246  < 2e-16 ***
sqft_above    5.836e-05  4.778e-06  12.215  < 2e-16 ***
yr_built     -5.910e-04  9.005e-05  -6.563 5.42e-11 ***
yr_renovated  2.894e-03  4.532e-04   6.385 1.75e-10 ***
zipcode98002 -4.958e-02  1.848e-02  -2.682 0.007318 ** 
zipcode98003  1.773e-02  1.687e-02   1.051 0.293503    
zipcode98004  1.092e+00  1.660e-02  65.779  < 2e-16 ***
zipcode98005  7.146e-01  1.968e-02  36.313  < 2e-16 ***
zipcode98006  6.409e-01  1.490e-02  43.021  < 2e-16 ***
zipcode98007  6.385e-01  2.101e-02  30.396  < 2e-16 ***
zipcode98008  6.435e-01  1.679e-02  38.337  < 2e-16 ***
zipcode98010  2.316e-01  2.345e-02   9.878  < 2e-16 ***
zipcode98011  4.433e-01  1.850e-02  23.964  < 2e-16 ***
zipcode98014  2.924e-01  2.152e-02  13.588  < 2e-16 ***
zipcode98019  3.174e-01  1.926e-02  16.479  < 2e-16 ***
zipcode98022  5.504e-02  1.765e-02   3.118 0.001821 ** 
zipcode98023 -4.004e-02  1.474e-02  -2.717 0.006595 ** 
zipcode98024  4.040e-01  2.561e-02  15.777  < 2e-16 ***
zipcode98027  4.917e-01  1.527e-02  32.203  < 2e-16 ***
zipcode98028  4.096e-01  1.689e-02  24.251  < 2e-16 ***
zipcode98029  5.790e-01  1.643e-02  35.241  < 2e-16 ***
zipcode98030  4.515e-02  1.702e-02   2.652 0.008010 ** 
zipcode98031  6.057e-02  1.693e-02   3.579 0.000346 ***
zipcode98032 -6.651e-02  2.165e-02  -3.072 0.002132 ** 
zipcode98033  7.734e-01  1.511e-02  51.178  < 2e-16 ***
zipcode98034  5.319e-01  1.436e-02  37.047  < 2e-16 ***
zipcode98038  1.664e-01  1.424e-02  11.689  < 2e-16 ***
zipcode98039  1.216e+00  3.276e-02  37.121  < 2e-16 ***
zipcode98040  8.690e-01  1.711e-02  50.787  < 2e-16 ***
zipcode98042  4.830e-02  1.439e-02   3.356 0.000792 ***
zipcode98045  3.271e-01  1.784e-02  18.333  < 2e-16 ***
zipcode98052  6.267e-01  1.426e-02  43.961  < 2e-16 ***
zipcode98053  5.779e-01  1.537e-02  37.592  < 2e-16 ***
zipcode98055  1.236e-01  1.696e-02   7.287 3.30e-13 ***
zipcode98056  3.072e-01  1.538e-02  19.968  < 2e-16 ***
zipcode98058  1.460e-01  1.496e-02   9.759  < 2e-16 ***
zipcode98059  3.301e-01  1.493e-02  22.106  < 2e-16 ***
zipcode98065  4.133e-01  1.649e-02  25.071  < 2e-16 ***
zipcode98070  3.223e-01  2.323e-02  13.876  < 2e-16 ***
zipcode98072  4.853e-01  1.713e-02  28.327  < 2e-16 ***
zipcode98074  5.462e-01  1.511e-02  36.152  < 2e-16 ***
zipcode98075  5.564e-01  1.592e-02  34.951  < 2e-16 ***
zipcode98077  4.297e-01  1.866e-02  23.026  < 2e-16 ***
zipcode98092  2.280e-02  1.600e-02   1.425 0.154233    
zipcode98102  9.174e-01  2.376e-02  38.612  < 2e-16 ***
zipcode98103  7.813e-01  1.478e-02  52.855  < 2e-16 ***
zipcode98105  9.107e-01  1.834e-02  49.642  < 2e-16 ***
zipcode98106  2.756e-01  1.619e-02  17.021  < 2e-16 ***
zipcode98107  8.052e-01  1.765e-02  45.617  < 2e-16 ***
zipcode98108  3.157e-01  1.943e-02  16.250  < 2e-16 ***
zipcode98109  9.390e-01  2.346e-02  40.034  < 2e-16 ***
zipcode98112  9.881e-01  1.758e-02  56.212  < 2e-16 ***
zipcode98115  7.869e-01  1.448e-02  54.360  < 2e-16 ***
zipcode98116  7.399e-01  1.637e-02  45.192  < 2e-16 ***
zipcode98117  7.666e-01  1.471e-02  52.106  < 2e-16 ***
zipcode98118  4.276e-01  1.495e-02  28.607  < 2e-16 ***
zipcode98119  9.500e-01  1.970e-02  48.222  < 2e-16 ***
zipcode98122  7.623e-01  1.712e-02  44.539  < 2e-16 ***
zipcode98125  5.486e-01  1.553e-02  35.326  < 2e-16 ***
zipcode98126  5.120e-01  1.616e-02  31.676  < 2e-16 ***
zipcode98133  4.247e-01  1.477e-02  28.751  < 2e-16 ***
zipcode98136  6.595e-01  1.729e-02  38.136  < 2e-16 ***
zipcode98144  6.348e-01  1.630e-02  38.951  < 2e-16 ***
zipcode98146  2.656e-01  1.672e-02  15.883  < 2e-16 ***
zipcode98148  1.589e-01  3.029e-02   5.245 1.58e-07 ***
zipcode98155  3.952e-01  1.504e-02  26.274  < 2e-16 ***
zipcode98166  3.115e-01  1.728e-02  18.024  < 2e-16 ***
zipcode98168  4.878e-02  1.703e-02   2.865 0.004179 ** 
zipcode98177  6.238e-01  1.755e-02  35.536  < 2e-16 ***
zipcode98178  1.389e-01  1.710e-02   8.124 4.80e-16 ***
zipcode98188  7.854e-02  2.123e-02   3.699 0.000217 ***
zipcode98198  6.996e-02  1.689e-02   4.142 3.46e-05 ***
zipcode98199  8.417e-01  1.664e-02  50.591  < 2e-16 ***
renov_Y.NY   -5.709e+00  9.044e-01  -6.312 2.81e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1901 on 17716 degrees of freedom
Multiple R-squared:  0.8692,    Adjusted R-squared:  0.8685 
F-statistic:  1337 on 88 and 17716 DF,  p-value: < 2.2e-16

Part b (5 pts)

What is the \(\sqrt{MSE}\) when you predict with your OLS model on the non-test set? What about on the in-time and the out-of-time test sets? Calculate these quantities by using the definition of \(\sqrt{MSE}\). You are not allowed to use any functions that have not been covered in this course to do this part.

Answer

#calculate the RMSE from the non training set 
pred_non_test <- predict(ols, newdata = non_test)
mse_non_test <- mean((non_test$price - pred_non_test)^2)
sqrt_mse_non_test <- sqrt(mse_non_test)

#calculate the RMSE on the in time test set 
pred_test1 <- predict(ols, newdata = test1)
mse_test1 <- mean((test1$price - pred_test1)^2)
sqrt_mse_test1 <- sqrt(mse_test1)

#calculate the RMSE on the out time test set 
pred_test2 <- predict(ols, newdata = test2)
mse_test2 <- mean((test2$price - pred_test2)^2)
sqrt_mse_test2 <- sqrt(mse_test2)

Part c (4 pts; 3 pts for the 1st question, 1 pt for the 2nd)

The code below (you will need to install the {broom} and {ggfortify} packages before running the code) creates the scatterplot for residuals v.s. fitted values, and the quantile-quantile plot for the residuals. What do you expect to see if the distribution of the residuals is approximately Normal?

Also, explain what a fitted value is. For example, for a particular house with log-house-price 12.5, what does the fitted value of the log-house-price of this house mean?

broom::tidy(ols)
library(ggfortify) 
autoplot(ols, which = c(1:2), size = .2)

Answer When the distribution of the residuals is approximately Normal, I would expect to see these residuals scatter around zero. These points will be equally distributed along the horizontal line of zero with no clear patterns,
curves, or funnel shapes. In the QQ plot, data point should follow the 45 degree diagonal line. A fitted value should be calculated from coefficient in the ols model and then enter the value of variables to find the final value. A fitted value is also a predicted value. For example, for a house with log-house-price 12.5, the fitted value will be the model’s predicted log-house-price based on the house’s characteristics.

Question 5 (38 pts)

Here we will build a linear model with the LASSO.

Part a (12 pts)

Construct a LASSO regression model using all available variables, excluding sqft_basement. Make sure that you

  • expand all the categorical variables into multiple dummies,
  • and standardize all the numeric variables.
  • However, do not remove the predictors that have low variances as we did in lecture, since we want to keep some of the important variables, such as, waterfront.

Use 10-fold cross validations and ask R to choose 200 possible \(\lambda\) values to consider. Repeat the 10-fold cross validation process 3 times. When splitting the dataset for the 10-fold cross validation, please make sure that all folds have about the same proportion of houses that have a view to a water front. Use \(\sqrt{MSE}\) as the criterion for choosing the best \(\lambda\). The the selected \(\lambda\) should be the one that corresponds to the simplest model whose \(\sqrt{MSE}\) is within one SE of the lowest \(\sqrt{MSE}\).

Show the plot for \(\sqrt{MSE}\) v.s. \(\lambda\) values. Report the \(\lambda\) value that you choose and the coefficient estimates of your champion model (it is fine to just print the coefficient estimates as outputs in a code chunk). How many predictors does your champion model have? (Predictors are the X-variables in the mathematical form of your model, so the y-intercept is not a predictor.) Compare this number with that for the OLS model.

Please make sure that you print the coefficient estimates only and not other uninformative quantities, such as the NA values. (Depending on how you code this part, you might not see the NAs at all. Thus, not seeing any NAs is not an indication that something has gone wrong.) Also, please use seed 1746 for all parts that involve a random process.

Hint:

In lecture notes Ch 7.2, the output of coef() is a matrix. Suppose m is a matrix. Recall that m[1, ] extracts the first row of m and m[ ,1] extracts the first column. You should check that each extracted column is a vector. It might be helpful to inspect the values in the column vector after you extract the column of interest.

Answer

library(tidymodels)
library(glmnet)
Loading required package: Matrix

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loaded glmnet 4.1-10
library(ggplot2)
library(dplyr)


set.seed(1746)
rec <- recipe(price ~ ., data = non_test) |>
  #remove sqft_basement from predictors 
  update_role(sqft_basement, new_role = "exclude") |>
  #in case if miss: convert all catgorical variable to factors 
  step_string2factor(all_nominal_predictors()) |>
  #handle unseen levels and missing values in categorical variables
  step_novel(all_nominal_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  #expand categorical variables into one-hot dummies
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  #Do not remove low-variance predictors
  step_zv(all_predictors(), skip = TRUE) |>  
  #standardize numeric predictors
  step_normalize(all_numeric_predictors())

#Building LASSO model 
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) |>
  set_engine("glmnet")

#Combine Preprocessing and model into one workflow 
wf <- workflow() |>
  add_recipe(rec) |>
  add_model(lasso_spec)

#10-fold CV repeated 3 times, stratified by waterfront
set.seed(1746)
folds <- vfold_cv(non_test, v = 10, repeats = 3, strata = waterfront)

#Create a grid of 200 candidate lambda values to try
lambda_grid <- grid_regular(penalty(), levels = 200)

#Tune lambda using repeated CV and RMSE
set.seed(1746)
tuned <- tune_grid(
  wf,
  resamples = folds,
  grid = lambda_grid,
  metrics = metric_set(rmse),
  control = control_grid(verbose = FALSE)
)

#pick the simplest model (largest lambda) whose RMSE is within 1 SE of the best RMSE
best_1se <- select_by_one_std_err(tuned, metric = "rmse", desc(penalty))

# Extract the selected lambda value
best_lambda <- best_1se$penalty
best_lambda
[1] 0.001933892

# ggplot RMSE vs lambda
rmse_df <- tuned |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  arrange(penalty)

ggplot(rmse_df, aes(x = penalty, y = mean)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err), width = 0) +
  scale_x_log10() +
  labs(x = expression(lambda), y = "RMSE (sqrt(MSE))")


# Fit champion model on full non_test set
champ_fit_1 <- fit(finalize_workflow(wf, best_1se), data = non_test)


# Coefficient estimates 
beta_mat <- extract_fit_engine(champ_fit_1) |> coef(s = best_lambda)  
beta_vec <- beta_mat[, 1]                                           
names(beta_vec) <- rownames(beta_mat)

beta_nz <- beta_vec[beta_vec != 0]  
beta_nz
   (Intercept)      bathrooms    sqft_living       sqft_lot 
  1.304463e+01   2.742830e-02   1.689790e-01   2.436458e-02 
         grade     sqft_above       yr_built   yr_renovated 
  1.422791e-01   2.831093e-02  -2.711875e-02   1.058981e-02 
   floors_X1.5      floors_X2      floors_X3    floors_X3.5 
  3.392163e-03  -4.243201e-05  -1.246848e-02  -1.550334e-03 
 waterfront_X0  waterfront_X1   condition_X1   condition_X2 
 -5.665668e-02   1.539549e-12  -8.216951e-03  -1.475700e-02 
  condition_X3   condition_X5 zipcode_X98001 zipcode_X98002 
 -2.038882e-02   1.488602e-02  -5.407378e-02  -4.564817e-02 
zipcode_X98003 zipcode_X98004 zipcode_X98005 zipcode_X98006 
 -4.639385e-02   7.424417e-02   2.179173e-02   2.657530e-02 
zipcode_X98007 zipcode_X98008 zipcode_X98010 zipcode_X98014 
  1.316292e-02   2.067661e-02  -1.232678e-02  -9.111050e-03 
zipcode_X98019 zipcode_X98022 zipcode_X98023 zipcode_X98027 
 -9.397352e-03  -3.802053e-02  -7.012997e-02   4.113833e-03 
zipcode_X98028 zipcode_X98029 zipcode_X98030 zipcode_X98031 
 -2.412765e-03   1.324846e-02  -4.234110e-02  -4.098745e-02 
zipcode_X98032 zipcode_X98033 zipcode_X98034 zipcode_X98038 
 -3.756390e-02   4.394353e-02   1.162973e-02  -4.278237e-02 
zipcode_X98039 zipcode_X98040 zipcode_X98042 zipcode_X98045 
  3.406926e-02   4.536323e-02  -5.951537e-02  -9.654922e-03 
zipcode_X98052 zipcode_X98053 zipcode_X98055 zipcode_X98056 
  2.687361e-02   1.744131e-02  -3.423314e-02  -1.652151e-02 
zipcode_X98058 zipcode_X98059 zipcode_X98065 zipcode_X98070 
 -4.061023e-02  -1.426717e-02  -1.642287e-03  -5.404386e-03 
zipcode_X98072 zipcode_X98074 zipcode_X98075 zipcode_X98092 
  2.445578e-03   1.189524e-02   1.241333e-02  -5.009818e-02 
zipcode_X98102 zipcode_X98103 zipcode_X98105 zipcode_X98106 
  2.890257e-02   4.943724e-02   4.365162e-02  -2.009379e-02 
zipcode_X98107 zipcode_X98108 zipcode_X98109 zipcode_X98112 
  3.506652e-02  -1.072588e-02   3.142234e-02   5.595960e-02 
zipcode_X98115 zipcode_X98116 zipcode_X98117 zipcode_X98118 
  5.161348e-02   3.243647e-02   4.648671e-02  -2.310932e-03 
zipcode_X98119 zipcode_X98122 zipcode_X98125 zipcode_X98126 
  4.184453e-02   3.221579e-02   1.082247e-02   5.303471e-03 
zipcode_X98133 zipcode_X98136 zipcode_X98144 zipcode_X98146 
 -2.440224e-03   2.013469e-02   1.983883e-02  -1.938935e-02 
zipcode_X98148 zipcode_X98155 zipcode_X98166 zipcode_X98168 
 -1.286100e-02  -5.663881e-03  -1.271725e-02  -4.307620e-02 
zipcode_X98177 zipcode_X98178 zipcode_X98188 zipcode_X98198 
  1.630363e-02  -3.263009e-02  -2.729514e-02  -4.027804e-02 
zipcode_X98199 
  4.355356e-02 

# Number of predictors in champion model
n_pred_lasso <- sum(beta_vec[-1] != 0)
n_pred_lasso
[1] 84
#My model has 84 predictors 

# Number of predictors for OLS model 
prep_rec <- prep(rec, training = non_test)
X <- bake(prep_rec, new_data = non_test) |> select(-price)
n_pred_ols <- ncol(X)
n_pred_ols
[1] 104
#The OLS model has 104 predictors 
#the predictors in my LASSO model is 20 less than the ols model 

Part b (7 pts: 4 for making data frame plot.dat, 2 for the graph, 1 for the value for b)

We will consider additional variables to try to improve our model.

Create a data frame, plot.dat, and use it to make the scatterplot for the average of log(price) v.s. the number of bedrooms for the houses in majority. From the scatterplot we can see that for houses with b or more bedrooms, on average the more bedrooms a house has, the lower the sold price would be. Provide the value of b.

Answer

library(dplyr)
library(ggplot2)

# Make the plot: average log(price) vs bedrooms in majority
plot.dat <- majority |>
  group_by(bedrooms) |>
  summarise(avg_log_price = mean(log(price), na.rm = TRUE), .groups = "drop") |>
  arrange(bedrooms)

# scatterplot
print( ggplot(plot.dat, aes(x = bedrooms, y = avg_log_price)) +
  geom_point() +
  labs(x = "Bedrooms", y = "Average log(price)") )


#the b is very close to 8, the scatter plot is first an increasing diagonal
#line and then decreasing. B is the value when the two lines intersect 

Part c (6 pts)

We will improve our model by creating an additional variable for our model to consider.

For the non-test set, add an additional factor column that indicates if a house has 8 or more bedrooms and name this column eight.plus. Also do this for each of the test sets.

Answer

library(dplyr)

non_test <- non_test |>
  mutate(
    eight.plus = factor(bedrooms >= 8, levels = c(FALSE, TRUE))
  )

test1 <- test1 |>
  mutate(
    eight.plus = factor(bedrooms >= 8, levels = c(FALSE, TRUE))
  )

test2 <- test2 |>
  mutate(
    eight.plus = factor(bedrooms >= 8, levels = c(FALSE, TRUE))
  )

Part d (8 pts)

Improve your recipe in Part a by including the interaction term for bedrooms and eight.plus, and the interaction term for sqft_living and waterfront.

(Hint: The function to add interaction terms is step_interact(). Please check out its help page. In particular, see the example on the help page that deals with dummy variables and interaction terms. Note that the order you put in the recipe steps matters. You should see how each step in the recipe affects your dataset as we did in Week 10 PLE videos (see the video labeled “Video_10.02_Recipe”) by running the bake() and prep() functions.)

Refit your LASSO model with these additional terms. Again, report the coefficient estimates of your champion model (it is fine to just print the coefficient estimates as outputs in a code chunk).

Answer

library(tidymodels)
library(glmnet)

set.seed(1746)

# Make a recipe with interactions:c1) bedrooms:eight.plus 2) sqft_living:waterfront

rec_int <- recipe(price ~ ., data = non_test) |>
  update_role(sqft_basement, new_role = "exclude") |>

  # ensure categoricals are factors
  step_string2factor(all_nominal_predictors()) |>
  
  # handle unseen levels and missing values in categorical variables
  step_novel(all_nominal_predictors()) |>
  step_unknown(all_nominal_predictors()) |>

  # expand categoricals
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>

  step_interact(terms = ~ bedrooms:all_of("eight.plus_TRUE.")) |>
  step_interact(terms =  ~ all_of("sqft_living"):starts_with("waterfront")) |>

  # Keep remove low-variance predictors
  step_zv(all_predictors(), skip = TRUE) |>

  # standardize numeric predictors 
  step_normalize(all_numeric_predictors())

#  LASSO workflow
lasso_spec <- linear_reg(penalty = tune(), mixture = 1) |>
  set_engine("glmnet")

wf_int <- workflow() |>
  add_recipe(rec_int) |>
  add_model(lasso_spec)

# Repeated stratified CV (3 repeats) on the non-test set
set.seed(1746)
folds_int <- vfold_cv(non_test, v = 10, repeats = 3, strata = waterfront)

lambda_grid <- grid_regular(penalty(), levels = 200)

set.seed(1746)
tuned_int <- tune_grid(
  wf_int,
  resamples = folds_int,
  grid = lambda_grid,
  metrics = metric_set(rmse),
  control = control_grid(verbose = FALSE)
)

# 1-SE rule (simplest within 1 SE of best RMSE)
best_1se_int <- select_by_one_std_err(
  tuned_int,
  metric = "rmse",
  desc(penalty)
)

best_lambda_int <- best_1se_int$penalty
best_lambda_int
[1] 0.001933892

# Fit champion model on full non-test set
champ_wf_int <- finalize_workflow(wf_int, best_1se_int)

set.seed(1746)
champ_fit_int <- fit(champ_wf_int, data = non_test)

#  Print coefficient estimates 
beta_mat_int <- extract_fit_engine(champ_fit_int) |> coef(s = best_lambda_int)

beta_vec_int <- beta_mat_int[, 1]
names(beta_vec_int) <- rownames(beta_mat_int)

beta_nz_int <- beta_vec_int[beta_vec_int != 0]
beta_nz_int
      (Intercept)         bathrooms       sqft_living 
     1.304463e+01      2.781840e-02      1.692849e-01 
         sqft_lot             grade        sqft_above 
     2.433777e-02      1.418882e-01      2.831339e-02 
         yr_built      yr_renovated       floors_X1.5 
    -2.736713e-02      1.053222e-02      3.292042e-03 
        floors_X2         floors_X3       floors_X3.5 
    -4.411209e-05     -1.245409e-02     -1.196059e-03 
    waterfront_X0     waterfront_X1      condition_X1 
    -5.663056e-02      1.549764e-12     -8.225782e-03 
     condition_X2      condition_X3      condition_X5 
    -1.475762e-02     -2.032407e-02      1.484141e-02 
   zipcode_X98001    zipcode_X98002    zipcode_X98003 
    -5.405622e-02     -4.563950e-02     -4.637663e-02 
   zipcode_X98004    zipcode_X98005    zipcode_X98006 
     7.439725e-02      2.179374e-02      2.662295e-02 
   zipcode_X98007    zipcode_X98008    zipcode_X98010 
     1.317231e-02      2.068411e-02     -1.231328e-02 
   zipcode_X98014    zipcode_X98019    zipcode_X98022 
    -9.099997e-03     -9.390084e-03     -3.799621e-02 
   zipcode_X98023    zipcode_X98027    zipcode_X98028 
    -7.010580e-02      4.121563e-03     -2.405545e-03 
   zipcode_X98029    zipcode_X98030    zipcode_X98031 
     1.326827e-02     -4.232922e-02     -4.097664e-02 
   zipcode_X98032    zipcode_X98033    zipcode_X98034 
    -3.754629e-02      4.395429e-02      1.164417e-02 
   zipcode_X98038    zipcode_X98039    zipcode_X98040 
    -4.277227e-02      3.405798e-02      4.536062e-02 
   zipcode_X98042    zipcode_X98045    zipcode_X98052 
    -5.948852e-02     -9.635354e-03      2.689335e-02 
   zipcode_X98053    zipcode_X98055    zipcode_X98056 
     1.745581e-02     -3.417170e-02     -1.650574e-02 
   zipcode_X98058    zipcode_X98059    zipcode_X98065 
    -4.059001e-02     -1.421588e-02     -1.652657e-03 
   zipcode_X98070    zipcode_X98072    zipcode_X98074 
    -5.379382e-03      2.458150e-03      1.191607e-02 
   zipcode_X98075    zipcode_X98092    zipcode_X98102 
     1.242021e-02     -5.007505e-02      2.899638e-02 
   zipcode_X98103    zipcode_X98105    zipcode_X98106 
     4.944545e-02      4.393791e-02     -2.002700e-02 
   zipcode_X98107    zipcode_X98108    zipcode_X98109 
     3.512592e-02     -1.072806e-02      3.142469e-02 
   zipcode_X98112    zipcode_X98115    zipcode_X98116 
     5.610629e-02      5.162770e-02      3.245046e-02 
   zipcode_X98117    zipcode_X98118    zipcode_X98119 
     4.650223e-02     -2.305599e-03      4.184549e-02 
   zipcode_X98122    zipcode_X98125    zipcode_X98126 
     3.228005e-02      1.083499e-02      5.326704e-03 
   zipcode_X98133    zipcode_X98136    zipcode_X98144 
    -2.369011e-03      2.014982e-02      1.989829e-02 
   zipcode_X98146    zipcode_X98148    zipcode_X98155 
    -1.937725e-02     -1.285833e-02     -5.647182e-03 
   zipcode_X98166    zipcode_X98168    zipcode_X98177 
    -1.271305e-02     -4.306409e-02      1.631111e-02 
   zipcode_X98178    zipcode_X98188    zipcode_X98198 
    -3.262984e-02     -2.729362e-02     -4.026095e-02 
   zipcode_X98199 eight.plus_FALSE.  eight.plus_TRUE. 
     4.355900e-02      3.732245e-03     -1.510602e-14 

pred_non <- non_test |>
  mutate(.pred = predict(champ_fit_int, non_test)$.pred)

pred_in <- test1 |>
  mutate(.pred = predict(champ_fit_int, test1)$.pred)

pred_out <- test2 |>
  mutate(.pred = predict(champ_fit_int, test2)$.pred)

rmse_non <- rmse(pred_non, truth = price, estimate = .pred)$.estimate
rmse_in  <- rmse(pred_in,  truth = price, estimate = .pred)$.estimate
rmse_out <- rmse(pred_out, truth = price, estimate = .pred)$.estimate

Part e (5 pts)

What is the \(\sqrt{MSE}\) when you predict with your Champion model on the non-test set? What about on the in-time and the out-of-time test sets?

Side note 1: If the LASSO picks most of the variables, then it is not surprising that the error rate is not improved from the error rate for the OLS model.

Side note 2: Additional improvement can be make by combining zipcodes that have similar average house prices into the same category; however, due to time constraint, we will not do that here.

Answer

The RMSE for the non test is 0.1908648, in time set is 0.1939395, and out-of-time is 0.2207851

The final champion LASSO model retains most of the main structural predictors such as bathrooms, square footage measures, grade, and year built, as well as a large number of zipcode indicators. This suggests that regularization does not substantially reduce model complexity relative to OLS.

The indicator variable eight.plus has a coefficient essentially equal to zero, indicating that after controlling for other predictors, distinguishing houses with eight or more bedrooms does not provide additional predictive power.

Overall, this is consistent with the observation that the LASSO model does not significantly improve prediction accuracy relative to the OLS model, as noted in Side Note 1.


  1. This is a subset of a dataset from kaggle.com. We will use only a subset of the variables in the dataset because we only want to include variables that have clear definitions and seem relevant for house prices.↩︎